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Abstract 

The  MPEG  Encoded  Retrieval  and  Indexation  Toolkit  (MERIT)  performs  video  segmen¬ 
tation  of  MPEG  hies  in  the  compressed  domain,  nsing  an  algorithm  based  on  macroblock 
type  statistics.  It  was  written  in  C  by  V.  Kobla  in  1995  for  his  doctoral  work  at  the  Uni¬ 
versity  of  Maryland.  Kobla’s  code  dealt  with  MPEG-1  hies  only.  In  this  report  we  npdate 
MERIT  to  analyze  MPEG-2  hies  as  well.  We  modify  the  hie  parsing  process  to  acconnt 
for  the  MPEG-2  specihcations.  We  acconnt  for  the  new  motion  compensation  modes  in- 
trodnced  by  MPEG-2  in  preliminary  compntations,  generating  data  strnctnres  that  allow 
the  original  MERIT  segmentation  algorithm  to  work  properly.  A  series  of  tests  conhrmed 
the  validity  of  onr  solntions.  The  new  version  4.0  of  MERIT  is  a  snperset  of  MERIT  3.3, 
insofar  as  it  gives  the  same  resnlts  for  MPEG-1  hies,  and  is  able  to  analyze  MPEG-2  hies 
nsing  almost  all  the  available  options.  Enrther  improvements  conld  be  made  to  address 
key-frame  storage  and  higher  chrominance  formats. 
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Abstract 

The  MPEG  Encoded  Retrieval  and  Indexation  Toolkit  (MERIT)  performs  video  segmen¬ 
tation  of  MPEG  hies  in  the  compressed  domain,  nsing  an  algorithm  based  on  macroblock 
type  statistics,  ft  was  written  in  C  by  V.  Kobla  in  1995  for  his  doctoral  work  at  the  Uni¬ 
versity  of  Maryland.  Kobla’s  code  dealt  with  MPEG-1  hies  only.  In  this  report  we  npdate 
MERIT  to  analyze  MPEG-2  hies  as  well.  We  modify  the  hie  parsing  process  to  acconnt 
for  the  MPEG-2  specihcations.  We  acconnt  for  the  new  motion  compensation  modes  in- 
trodnced  by  MPEG-2  in  preliminary  compntations,  generating  data  strnctnres  that  allow 
the  original  MERIT  segmentation  algorithm  to  work  properly.  A  series  of  tests  conhrmed 
the  validity  of  onr  solntions.  The  new  version  4.0  of  MERIT  is  a  snperset  of  MERIT  3.3, 
insofar  as  it  gives  the  same  resnlts  for  MPEG-1  hies,  and  is  able  to  analyze  MPEG-2  hies 
nsing  almost  all  the  available  options.  Enrther  improvements  conld  be  made  to  address 
key-frame  storage  and  higher  chrominance  formats. 

Keywords:  video  segmentation,  MERIT,  MPEG  specihcations,  macroblock,  motion 
compensation. 


1  Overview 


MERIT  (MPEG  Encoded  Retrieval  and  Indexing  Toolkit  [2,  11])  is  a  package  that  per¬ 
forms  various  MPEG-based  analyses,  such  as  video  segmentation  [5,  10],  motion  analysis, 
extraction  of  estimated  DC  coefficients  or  of  flow  vector  information.  It  was  written  in 
C  by  V.  Kobla  for  his  doctoral  work  at  the  University  of  Maryland.  Video  segmentation 
consists  of  identifying  breaks  or  cuts  in  an  MPEG-encoded  video  clip,  thus  dividing  the 
video  into  shots.  A  shot  can  be  dehned  as  a  minimal  sequence  of  frames  resulting  from 
a  continuous  uninterrupted  recording  by  an  input  device  such  a  camera.  In  MERIT,  the 
analysis  is  performed  in  the  compressed  domain  using  available  macroblock  and  motion 
vector  information,  and,  if  necessary,  DOT  information.  When  MERIT  is  run  without 
any  options  and  with  just  the  MPEG  hlename,  it  prints  to  stdout  the  list  of  key-frame 
numbers,  which  are  the  hrst  frames  of  the  segmented  shots.  These  key-frames  can  be  used 
in  further  applications,  such  as  archiving  or  indexing  of  video  sequences  [3,  4].  Options 
available  with  MERIT  will  be  described  in  Section  2.2. 

MPEG  is  a  digital  compression  standard  for  both  video  and  audio,  developed  by  the 
ISO  MPEG  committee.  ISO  is  developing  a  family  of  MPEG  standards.  The  group 
produced  MPEG-f,  the  standard  for  Video  CD  and  MP3;  MPEG-2,  the  standard  for 
Digital  Television  set  top  boxes  and  DVD;  and  MPEG-4,  the  standard  for  multimedia  on 
the  Web.  It  is  also  developing  the  MPEG-7  “Multimedia  Content  Description  Interface”. 

Until  now,  MERIT  could  extract  information  only  from  MPEG-f  encoded  hies.  An 
error  would  occur  if  one  tried  to  read  an  MPEG-2  encoded  hie  with  MERIT.  The  main 
goal  of  this  project  consisted  in  developing  a  new  version  of  MERIT  which  is  able  to 
process  MPEG-2  encoded  hies.  Actually,  MPEG-2  is  a  superset  of  MPEG-f.  This  means 
that  the  MPEG-2  standard  includes  the  MPEG-f  standard.  Thus,  an  MPEG-2  version 
of  MERIT  works  with  MPEG- 1  hies  as  well. 

In  this  report,  the  current  MERIT  3.3  program  is  hrst  outlined  in  detail.  A  description 
of  the  specihc  MPEG  syntax  used  in  this  report  can  be  found  in  Appendix  A.  Then, 
we  describe  what  needs  to  be  removed  or  changed  in  Version  3.3  of  MERIT.  Einally, 
we  report  on  the  implementation  of  the  new  version,  the  problems  encountered,  and 
their  solutions. 

2  MERIT  3.3 

2.1  System  overview 

Compressed  video  in  MPEG  format  is  hrst  segmented  into  shots  (continuous  uninter¬ 
rupted  sequences  of  frames).  The  segmented  video  is  then  passed  on  to  a  motion  vector 
analysis  routine  where  global  camera  motion  information  is  extracted.  Using  the  camera 
motion  information,  shots  are  subdivided  into  subshots.  Einally,  key-frames  are  extracted 
from  these  shots  and  subshots. 

A  shot  change  usually  occurs  abruptly  between  two  frames.  Such  a  shot  change 
is  called  a  cut.  Cut  detection  is  performed  using  the  macroblock  (MB)  type.  If  a  P- 
frame  contains  primarily  intra-coded  MBs,  this  suggests  that  these  MBs  could  not  be 
predicted  from  the  previous  reference  frame,  and  a  cut  may  have  occurred  between  the 
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previous  I-  or  P-frame  and  the  current  P-frame.  If  in  a  B-frame  a  majority  of  MBs  are 
forward-predicted  from  the  previous  I-  or  P-frame,  there  is  a  high  probability  that  a  shot 
change  has  occurred  between  the  current  frame  and  the  next  I-  or  P-frame.  Similarly,  if 
a  majority  of  MBs  in  a  B-frame  are  backward-predicted,  there  is  a  high  probability  that 
a  shot  change  has  occurred  between  the  previous  I-  or  P-frame  and  the  current  frame. 
If  there  is  a  shot  change,  all  B-frames  before  the  cut  must  have  a  majority  of  forward- 
predicted  MBs,  and  all  B-frames  after  the  cut  must  have  a  majority  of  backward-predicted 
MBs.  Tests  for  the  presence  of  these  features  are  employed  in  the  detection  process. 

When  the  macroblock  information  is  found  to  be  inconclusive,  the  algorithm  uses  the 
DCT  coefficients  to  conhrm  the  existence  of  shot  changes.  After  the  shots  have  been 
found,  MERIT  uses  the  motion  vector  information  to  hnd  subshots,  or  scene  changes.  A 
paper  written  by  V.  Kobla  [5]  provides  details  about  this  algorithm. 

2.2  Usage 

Many  options  are  available  with  MERIT  [2].  When  MERIT  is  run  without  any  options 
and  with  just  the  MPEG  hlename,  it  prints  to  stdout  the  list  of  key-frame  numbers, 
which  are  the  hrst  frames  of  the  segments  dehned  by  the  breaks.  Depending  on  the 
option,  the  algorithm  uses  the  I-frames  for  segmentation  analysis  (-dct),  or  performs 
DCT  validation  for  certain  (-valid)  or  all  (-full)  detected  cuts.  The  motion  analysis 
can  also  be  turned  on  (-motion).  A  display  window  can  provide  graphic  information 
about  the  MB  types  and  motion  vectors  (-dispMV,  -dispFlow).  Moreover,  several  types 
of  information  can  be  printed  or  saved  in  a  hie.  Einally,  one  can  choose  the  start-frame 
and  the  end-frame  numbers  (-start,  -end). 

2.3  General  architecture 

MERIT  comprises  two  main  components:  bit  stream  parsing  and  video  segmentation. 
The  hrst  component  parses  the  MPEG  hie  and  collects  the  relevant  information  in  specihc 
structures.  The  second  component  uses  this  information  to  perform  video  segmentation. 
As  we  will  see,  these  two  parts  are  quite  independent. 

2.3.1  MPEG  bit  stream  parsing 

The  mainO  function  is  located  in  the  MERIT. c  hie.  This  function  processes  the  options 
and  calls  Parseblkf  ile()  or  ParseDctFileO ,  both  located  in  the  Parse,  c  hie.  These 
two  functions  are  intermediate  functions,  insofar  as  they  only  set  options  for  another 
function  that  they  call,  mpeg_stat(),  located  in  main.c. 

In  fact,  mpeg_stat  ()  is  the  function  which  parses  the  stream.  It  takes  two  arguments. 
The  hrst,  mpegf ilename,  is  the  name  of  the  hie  to  be  parsed.  The  second,  dct,  is  a 
hag  indicating  whether  the  DCT  coefficients  are  needed  (1)  or  not  (0).  The  function 
mpeg_stat()  corresponds  to  the  main()  function  of  another  program,  called  mpeg_stat 
and  written  by  Steve  Smoot  at  U.  C.  Berkeley.  The  MERIT  program  includes  all  hies 
from  mpeg^tat,  some  of  which  have  been  adapted  by  V.  Kobla  (filter,  c,  main.c, 
parseblock .  c,  proto  .  h,  util .  c,  video  .  c,  video  .h).  These  imported  hies  can  easily  be 
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distinguished  from  the  MERIT-specihc  hies,  since  they  begin  with  a  lower-case  letter  (e.g. 
video,  c),  whereas  the  MERIT-specihc  hies  begin  with  a  capital  letter  (e.g.  Parse,  c). 

The  main  purpose  of  the  original  mpeg_stat  is  to  provide  statistics  about  the  features 
of  an  MPEG-f  encoded  hie.  Therefore  it  is  mainly  a  parsing  program,  and  it  is  used  by 
MERIT.  A  detailed  description  of  mpeg_stat  is  provided  below.  The  parsing  function 
of  MERIT  is  completely  performed  by  the  mpeg^tat  code,  and  only  uses  some  global 
variables  shared  by  other  parts  (dehned  in  Global. h  and  List.h).  Actually,  mpeg^tat 
has  been  modihed  to  hll  in  specihc  global  structures  with  information,  instead  of  passing 
it  through  hies,  as  in  the  original  version. 

2.3.2  Storing  MPEG  information 

In  order  to  run  the  segmentation  algorithm,  MERIT  needs  some  preliminary  information 
from  the  MPEG  hie.  Here  is  a  short  description  of  the  structures  in  which  the  func¬ 
tion  mpeg_stat()  stores  the  information.  These  structures  are  declared  and  dehned  in 
Global  .h.  List  .h  and  List .  c 

•  List  .h 

This  hie  declares  the  Line  structure  as  follows: 
typedef  struct  LINE 
{ 

int  LineType  jblockType ,f rameNo , qscale ,numBits ,blockNo ; 
char  f rameType , *dctSpecs ; 

MV  *forw,*back; 
struct  LINE  *next ; 

}  Line; 

LineType  is  the  type  of  the  Line,  which  can  be  BLOCK  (value:  10),  SLICE  (11), 
FRAME  (12)  or  GOP  (for  Group  Of  Pictures,  value:  13). 

blockType  is  the  type  of  the  block,  if  the  LineType  is  BLOCK.  Possible  values: 
INTRA  (200),  FORW  (201),  BI  (202),  BACK  (203),  or  SKIP  (204). 

frameNo  is  the  number  of  the  current  frame. 

qscale  is  the  quantization  scale  used  for  the  DOT. 

numBits  is  the  total  number  of  bits  used  for  a  particular  MB. 

blockNo  is  the  number  of  the  macroblock,  when  needed  (if  LineType  =  BLOCK 
only). 

f  rameType  is  the  type  of  the  frame,  when  needed  ( '  I ^  P  G  or  '  B  G  if  LineType  = 
FRAME  only). 

dct Specs  is  a  string  with  all  the  DC  and  AC  coefhcients  for  each  block  of  a  mac¬ 
roblock  (if  LineType  =  BLOCK  only). 

forw  is  the  forward  motion  vector  (MV  is  a  structure  with  the  x  and  y  coordinates), 
when  needed  (if  LineType  =  BLOCK  only). 
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back  is  the  backward  motion  vector,  when  needed  (if  LineType  =  BLOCK  only). 

Initially,  the  Line  information  was  to  be  an  element  of  a  linked  list,  as  indicated 
by  the  presence  of  a  pointer  to  the  next  Line.  The  function  add_node()  adds  a 
new  empty  current  node  to  the  list.  But  actually  the  program  does  not  use  this  list 
structure.  The  function  add_node()  is  never  called  and  the  information  is  stored 
in  a  current  Line  and  immediately  processed,  before  the  current  Line  is  refreshed 
(ref  resh_node() )  and  new  information  is  stored. 

•  List.c 

This  hie  dehnes  head  and  current  as  pointers  to  Line,  and  dehnes  the  function 
addjiodeO  (never  called),  and  the  function  refresh_node() .  This  function  resets 
the  Line  node  pointed  to  by  current  by  setting  all  the  integers  or  char  to  0  and 
the  pointers  to  NULL.  If  the  current  pointer  is  null,  it  creates  an  empty  structure 
for  Line  and  stores  its  address  in  current.  The  head  pointer  is  never  used,  since 
only  the  ref reshjiode  ()  function  is  called,  current  points  to  the  Line  structure 
where  all  the  needed  information  found  in  the  MPEG  bit  stream  is  stored. 

•  Global. h 

Global. h  dehnes  some  helpful  integer  constants  for  the  understanding  of  the  code 
(such  as  BLOCK  =  10,  SKIP  =  204...).  It  also  declares  the  function  mpeg_stat(), 
and  some  useful  structures,  such  as  MV  (motion  vector,  with  the  x  and  y  coordinates 
in  half- pixels;  brightness  values  between  pixels  are  computed  using  bilinear  interpo¬ 
lation),  flMV  (hoat  motion  vector,  for  averaging).  Furthermore,  Global. h  dehnes 
some  structures  where  the  information  is  stored  just  after  parsing  and  temporary 
storage  in  the  current  Line  structure.  The  most  important  structures  are  Info 
and  InfoList 

typedef  struct  INFO 

{ 

int  dispFrameNo; 
char  frameType; 
int  value; 

int  numint ra , numForw , numBack , numBidir , numSkip ; 
int  numSkipF orw , numSkipBack , numSkipBidir ; 

MV  =KforwMVs; 

MV  =KbackMVs; 

MV  =KdomMV; 
flMV  *meanMV; 
int  avgAngle ; 
char  *MBTypes ; 

}  Info; 


typedef  struct  INFOLIST 

{ 
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Info  *info; 

int  frameNo; 

struct  INFOLIST  *next ; 

}  InfoList; 

As  we  see,  all  the  information  abont  a  frame  is  stored  in  the  Info  strnctnre  (frame 
nnmber,  frame  type,  nnmber  of  Intra-coded  MBs,  number  of  Forward-,  Backward-, 
and  Bidirectionally-predicted  MBs,  nnmber  of  all  types  of  Skipped  MBs,  .  .  .  ). 

This  Info  for  each  frame  is  then  pnt  into  a  list  strnctnre  (InfoList). 

The  main  difference  between  the  Line  strnctnre  and  the  Info  strnctnre  is  that  Line 
can  be  nsed  to  describe  different  levels  of  objects  in  a  video  stream,  snch  as  Gronp  of 
Pictnres,  Pictnres,  Slices,  MBs  and  Blocks,  depending  on  the  LineType.  On  the  other 
hand.  Info  contains  information  for  a  whole  frame  (pictnre),  snch  as  the  frame  type 
and  the  nnmber  of  macroblocks  of  each  type.  This  strnctnre  is  nsed  in  the  segmentation 
algorithm,  which  needs  to  compare  the  featnres  of  snccessive  frames. 

Other  strnctnres  dehned  in  Global. h  (Motioninfo,  Sceneinfo,  RangeList,  .  .  . )  are 
nsed  in  the  segmentation  algorithm.  However,  we  don’t  need  to  stndy  these  strnctnres, 
since  onr  npgrade  of  MERIT  only  reqnired  a  modihcation  of  the  parsing  component 

To  insert  the  information  stored  in  the  current  Line  into  Info,  and  then  into  the 
InfoList  strnctnre,  MERIT  nses  fnnctions  mainly  referred  to  as  initInfo(),  and  dehned 
in  Parse. c.  initInfo()  has  one  argnment,  a  pointer  to  a  Line  strnctnre.  Essentially, 
initInfo()  deals  with  current.  After  each  parsing  by  mpeg^tat  (that  is,  when  current 
is  full),  a  call  to  initinfo (current )  puts  the  information  into  the  InfoList  structure. 
Then,  current  is  refreshed. 

There  are  three  initInfo()  functions:  initinf  oBIkMode  ()  ,  initinf  oDctMode  () ,  and 
initinf oDCTValidModeO .  Only  one  of  them  is  used  each  time  MERIT  is  running, 
depending  on  the  current  mode  (options  of  the  program). 

2.3.3  Video  segmentation 

When  the  parsing  is  completed,  the  video  segmentation  part  of  MERIT  begins  its  work. 
In  fact,  the  segmentation  is  totally  independent  of  the  parsing,  once  the  MPEG  bit  stream 
has  been  read.  Here  is  a  description  of  all  the  information  MERIT  3.3  currently  needs 
for  the  video  segmentation: 

•  Video  Stream  level: 

—  horizontal  picture  size  in  pixels  (h_size) 

—  vertical  picture  size  in  pixels  (v_size) 

—  horizontal  picture  size  in  MBs  (mb_width) 

—  vertical  picture  size  in  MBs  (mbJieight) 

—  total  number  of  frames  in  the  stream  (totalFrames) 

—  total  number  of  I-frames  (numIFrames) 
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•  Frame  level: 


—  frame  number  (display  order) 

—  frame  type  ( '  I ' ,  '  P ' ,  or  '  B ' ) 

•  Macroblock  level: 

—  macroblock  number  in  the  frame  (0  to  mb_width  X  mbjieight  —  1) 

—  macroblock  type  (FORW,  BACK,  INTRA,  BI,  0  (forward  and  no  motion  vector), 
SKIP) 

—  macroblock  motion  vectors(s) 

—  macroblock  quantization  scale 

—  number  of  bits  used  for  this  MB  (numBits) 

•  Block  level  (only  if  the  DCT  validation  is  needed): 

—  DCT  quantized  DC  coefficient  (0  to  2040,  for  intra-coded  MBs  only) 

—  DCT  quantized  AC  coefficients  :  position  in  matrix  (0  to  63)  and  level  (—2047 
to  2047)  of  non- zero  coefficients 

Four  modes  of  segmentation  are  provided  by  MERIT:  blockinfo  mode  (MERIT’S 
simplest  use,  without  any  option),  dct  mode  (-dct),  validation  mode  (-valid),  and  full 
mode  (-full). 

When  running  in  blockinfo  mode,  MERIT  uses  only  the  MBType  information.  In 
fact,  for  each  frame  a  variable  called  value  is  computed.  Its  numerical  value  depends  on 
the  picture  coding  type  (I,  P,  or  B),  but  is  set  as  the  number  of  MBs  in  the  frame  that 
are  of  a  certain  MB  type.  Therefore  value  is  in  the  range  between  0  and  the  maximum 
number  of  MBs  in  the  frame.  Usually,  value  is  high  for  pictures  very  similar  to  the 
previous  reference  frame(s),  and  low  for  pictures  with  no  similarities  to  the  previous 
reference  frames(s).  An  algorithm  then  uses  this  value  and  hnds  the  cuts. 

With  the  dct  mode,  MERIT  only  uses  the  I-frames.  This  mode  is  useful  for  streams 
in  XING  format  (only  I-pictures).  For  each  block  of  a  specihc  Tframe,  the  DC  coefficient 
(hrst  and  dominating  coefficient  of  the  cosine  transform  of  the  block)  is  compared  to 
the  DC  coefficient  of  the  same  block  in  the  previous  Tframe.  If  the  difference  is  higher 
than  a  hxed  threshold,  a  variable  called  sdd  is  increased.  Finally,  sdd  is  the  number  of 
non-similar  blocks  of  two  consecutive  Tframes,  and  reflects  the  difference  between  these 
Tframes,  and  therefore  the  likelihood  of  a  cut  between  these  frames. 

In  the  validation  mode,  verification  is  used  after  a  first  pass  in  blockinfo  mode  to 
validate  the  cuts,  when  there  are  ambiguities.  The  decision  whether  to  perform  validation 
depends  on  a  fixed  threshold  of  skipped  macroblocks  in  the  frame.  If  the  number  of 
skipped  macroblocks  is  higher  than  this  threshold,  the  number  of  significant  macroblocks 
is  found  to  be  insufficient  and  the  validation  is  performed.  For  each  ambiguous  cut,  an 
algorithm  finds  the  first  previous  I-frame  and  the  first  next  I-frame.  Then  the  dct  mode 
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is  performed  for  these  two  frames  only.  MERIT  verifies  the  presence  of  a  cnt  and  decides 
whether  to  keep  or  reject  this  cnt. 

The  fnll  mode  is  similar  to  the  validation  mode,  except  that  the  dct  validation  is 
performed  for  all  the  cnts,  even  if  there  is  no  ambignity. 

2.3.4  Other  tools 

MERIT  also  provides  other  tools,  described  by  V.  Kobla  as  secondary  options  for  per¬ 
forming  specific  minor  tasks  [2]. 

The  -dispMV  and  -dispFlow  options  pop  np  a  window  displaying  varions  information 
abont  the  macroblocks  of  each  frame.  dispMV  shows  the  macroblock  types  and  the 
motion  vectors,  and  dispElow  shows  the  flow  vectors.  The  flow  vectors  can  be 
compnted  and  stored  withont  display,  nsing  the  option  -flow. 

Motion  analysis  can  be  tnrned  on  with  the  -motion  option.  This  snbshot  analysis  is 
based  on  the  extracted  camera  motion  information. 

Estimation  of  the  DC  coefficients  of  P  and  B  frames  is  possible  with  -estimateDC, 
in  different  color  spaces. 

The  -saveppms  and  -showKey  options  allow  the  nser  to  store  and  display  the  key- 
frames.  While  -saveppms  only  stores  the  key-frames  in  a  ppm  snbdirectory, 
-showKey  creates  and  displays  a  montage  of  the  key-frames,  nsing  ImageMagick 
ntilities  with  the  stored  key-frames. 

Content  subshot  analysis  can  be  performed  with -contentAll  and -contentMotion. 
-contentAll  nses  the  flow  of  all  frames  whereas  -contentMotion  nses  the  flow  of 
frames  in  consistent  camera  motion  seqnences  only. 

3  mpeg_stat  and  mpeg2stat 

According  to  what  we  have  seen  in  earlier  sections,  in  order  to  make  MERIT  compatible 
with  MPEG-2,  we  mainly  need  to  npgrade  the  parsing  component  of  MERIT.  That  is, 
we  need  to  nnderstand  how  mpeg^tat  operates,  and  how  and  when  the  current  Line 
strnctnre  is  npdated.  A  new  version  of  mpeg^tat,  mpeg2stat,  already  exists  and  is  able 
to  analyze  MPEG-2  files.  We  have  to  modify  mpeg2stat  in  the  same  way  as  mpeg^tat 
has  been  modified  for  MERIT.  Thns,  an  mpeg2stat()  fnnction  in  MERIT  mnst  store 
information  in  the  same  strnctnres  as  mpeg_stat()  did.  It  is  not  necessary  to  tonch  the 
video  segmentation  part.  Therefore,  we  now  focns  on  describing  the  parsing  program 
mpeg^tat,  its  adaptation  to  MERIT,  and  the  latest  version  of  mpeg^tat,  mpeg2stat. 

3.1  mpeg_stat 

mpeg^tat  is  nsed  to  provide  statistics  from  an  MPEG-f  file.  Therefore,  it  parses  the 
MPEG  bit  stream,  and  collects  information.  In  its  original  version,  the  ontpnt  can  be 
stored  in  specific  files.  Some  options  are  available,  listed  below. 
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-quiet : 

-verify : 

-start  N: 

-end  N : 

-histogram  file: 
-qscale  file: 
-size  file: 
-offsets  file: 
-block_info  file: 
-dct 

-rate  file: 
-ratelength  N: 

-syslog  file: 
-userdata  file: 
-time : 

-all  file: 


Turn  off  output  of  frame  types/matrices  as 
encountered 

Do  more  work  to  help  assure  the  validity  of 
the  stream 

Begin  collection  at  frame  N  (first  frame  is  1) 
End  collection  at  frame  N  (end  >=  start) 

Put  detailed  histograms  into  file 

Put  qscale  information  into  file 

Write  individual  frame  type  and  size  into  file 

Write  high-level  header  offsets  into  file 

Put  macroblock  usage  into  file 

Put  decoded  DCT  information  into  block  file 

Put  instantaneous  rate  information  in  file 

Measure  bit  rate  per  W  frames,  not  one 

second's  worth 

Store  parsing  of  system  layer  into  file 

Store  user  data  information  into  file 

Measure  time  to  decode  frames 

Put  all  information  into  files  with  basename 

file 


When  mpeg^tat  is  run  without  any  options  and  just  with  an  MPEG  hlename,  it 
simply  parses  the  stream  and  outputs  information  to  stdout. 

The  mainO  function  is  located  in  the  main.c  hie.  This  function  only  processes 
the  options  and  calls  another  function  for  the  parsing:  mpegVidRsrcO ,  located  in  the 
video  .  c  hie.  This  hie  seems  to  be  the  main  hie  for  the  parsing,  ft  contains  several  useful 
functions  such  as  ParseG0P()  and  ParsePicture  () . 

3.1.1  Storing  structures 

While  reading  the  bit  stream,  mpeg^tat  puts  the  information  into  a  specihc  structure 
similar  to  the  Line  structure  in  MERIT.  This  structure  is  called  VidStream,  and  is  dehned 
in  video  .h.  Here  is  a  description  of  this  video  stream  structure,  with  some  explanations. 
The  GoP  (Group  of  Picture),  Piet  (Picture),  Slice,  Macroblock  and  Block  structures 
are  dehned  in  video  .h  as  well. 

/*  Video  stream  structure.  */ 

typedef  struct  vid_stream 

{ 

unsigned  int  h_size; 
unsigned  int  v_size; 
unsigned  int  mbJieight; 
unsigned  int  mb_width; 
unsigned  char  aspect_ratio ; 


/*  Horiz.  size  in  pixels.  */ 
/*  Vert,  size  in  pixels.  */ 

/*  Vert,  size  in  mblocks.  */ 
/*  Horiz.  size  in  mblocks.  */ 
/*  Code  for  aspect  ratio.  */ 
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unsigned  char  orig_picture jrate ; 
unsigned  char  picturejrate ; 
unsigned  int  bit_rate; 
unsigned  int  vbv_buf f er_size ; 

BOOLEAN  const_paramjf lag; 

unsigned  char  intra_quant_matrix  [8]  [8]; 

unsigned  char  non_intra_quant_matrix  [8] 

char  *ext_data; 
char  *user_data; 
int  ext_size; 
int  user_size; 

GoP  group; 

Piet  picture; 

Slice  slice; 

Macroblock  mblock; 

Block  block; 
int  state; 
int  bit_offset; 
unsigned  int  ^buffer; 

int  buf  JLength; 

unsigned  int  *buf_start; 
int  max_buf _length ; 

Pictimage  *past ; 

Pictimage  ^future; 

Pictimage  ^current ; 

Pictimage  *ring [RING^UF_SIZE]  ; 

}  VidStream; 


/*  Code  for  picture  rate.  */ 

/*  A  valid  picture  rate.  */ 

/*  Bit  rate.  */ 

/*  Minimum  buffer  size.  */ 

I*  Constrained  parameter 
flag.  */ 

/*  Quant,  matrix  for  intracoded 
frames.  */ 

[8]/*  Quant,  matrix  for  non- 
intracoded  frames.  */ 

/*  Extension  data.  */ 

/*  User  data.  */ 

/*  Length  of  Extension  data  */ 
/*  Length  of  User  data  */ 

/*  Current  group  of  pictures  */ 
/*  Current  picture.  */ 

/*  Current  slice.  */ 

/*  Current  macroblock.  */ 

/*  Current  block.  */ 

/*  State  of  decoding.  */ 

/*  Bit  offset  in  stream.  */ 

/*  Pointer  to  next  byte  in 
buffer.  */ 

/*  Length  of  remaining 
buffer.  */ 

/*  Pointer  to  buffer  start.  */ 
/*  Max  length  of  buffer.  */ 

/*  Past  predictive  frame.  */ 

/*  Future  predictive  frame.  */ 

/ *  Current  frame .  */ 

/*  Ring  buffer  of  frames.  */ 


As  we  can  see,  like  the  List  structure  in  MERIT,  this  VidStream  structure  follows 
the  MPEG  speciheations  about  the  semantics  of  the  video  stream  encoding.  The  video 
stream  is  divided  into  Groups  of  pictures.  Pictures,  Slices,  Macroblocks  and  Blocks. 


3.1.2  Parsing  techniques 

The  mpegVidRsrc  0  function  looks  at  the  hrst  bits  of  the  current  bit  stream.  Then 
it  identihes  one  of  the  following  codes:  Sequence-start-code,  Sequence-end-code,  GOP- 
start-code.  Picture-start-code,  Slice-start-code,  Macroblock-start-code  or  Block-start-code. 
Then  it  calls  a  specihe  function  to  parse  the  stream,  according  to  the  code  found: 
ParseSeqHeadO ,  ParseG0P(),  ParsePictureO ,  etc.  This  process  is  repeated  until  the 
end  of  the  bit  stream  is  reached  (or  the  last  specihed  frame,  if  the  -end  option  is  used). 
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All  the  Parse  “type” ()  functions  are  defined  in  video  .  c  except  for  the  block  type  parsing 
function,  which  is  dehned  in  the  parseblk.c  hie. 

Typically,  a  Parse“type” ()  function  parses  the  features  of  the  type  and  hlls  in  the 
VidStream  structure.  At  the  same  time,  it  writes  some  information  into  hies  if  needed. 

3.2  Adaptation  of  mpeg_stat  for  MERIT 

When  the  mpeg_stat  program  was  incorporated  into  MERIT,  some  modihcations  were 
needed.  We  now  discuss  the  relevant  modihcations  that  were  required  by  MERIT. 

3.2.1  mpeg_stat() 

The  mainO  function  of  mpeg^tat  has  been  renamed  mpeg_stat()  in  MERIT.  While 
mainO  does  the  processing  of  the  options,  mpeg_stat()  in  MERIT  only  requires  the  mpeg 
hlename  and  a  dct  hag.  Thus,  the  modihcations  are  only  an  adaptation  of  the  option  pro¬ 
cessing.  Then,  like  the  main  ()  function  in  mpeg^tat,  mpeg_stat()  calls  mpegVidRsrc  ()  . 
The  MERIT  hie  main .  c  includes  the  new  hies 

#include  "Global. h'' 

#include' 'List .h' ’ 

where  it  dehnes  the  extern  Line  pointers  head  and  current.  At  the  beginning  and  end 
of  mpeg_stat(),  these  pointers  are  freed  and  hlled  with  the  NULL  constant. 

mpegVidRsrc 0  is  located  in  video. c.  This  hie  seems  to  be  the  one  with  the  most 
important  modihcations.  Eirst,  there  are  declarations  of  external  variables  and  some 
necessary  internal  variables  or  functions.  Then,  mpegVidRsrcO  and  its  Parse“type” () 
subfunctions  have  been  modihed  to  hll  the  current  Line  while  parsing.  This  hlling  will 
now  be  referred  to  as  the  InfoList  building  algorithm.  It  involves  putting  the  bit  stream 
information  in  the  current  Line  and  transferring  the  information  from  current  to  Info 
and  InfoList. 

3.2.2  The  InfoList  building  algorithm 

Here  is  a  description  of  the  InfoList  building  algorithm  which  will  be  useful  for  the  upgrade 
to  MPEG-2  processing.  The  square  brackets  contain  the  name  of  the  function  where  the 
action  is  performed,  and  its  location. 

/*  extern  variables  */ 

extern  int  totalFrames 

extern  int  numIFrames 

extern  int  maxMV 

extern  int  mpegjnode 

extern  int  passlTotalFrames 

extern  Info  **info 

extern  Line  ^current 

int  h_size,  v_size,mb_width,mbJieight ; 
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/*  extern  functions  */ 

extern  void  add_node()  /*  Actually  no  use  is  made  of  this  function  */ 
extern  void  ref resh_node () 

extern  int  parseDctSelectiveContd(Inf o  **info,  int  totalframes) 
extern  void  initinf oBlkMode(Line  ^current) 
extern  void  initinf oDctMode(Line  ^current) 
extern  void  initinf oDCTValidMode (Line  ^current) 

/*  useful  intern  variables  */ 

MV  *forw,  *back 
int  mbtype 

[mpegVidRsrcO  ,  video,  c] 
while  not (end  of  file) 
read  next_start_code 
start : 


switch  start_code 

case  sequence_end_code : 
exit 

case  sequence_start_code : 

[ParseSeqHeadO  ,  video. c] 
parse  sequence  header 

get  horizontal  size  of  image  space  (h_size) 


get  vertical  size  of 
calculate  macroblock 
calculate  macroblock 

read  next  start  code 
goto  start 

case  group_of _picture_code ; 

[ParseGOPO,  video,  c] 
parse  GOP  header 
ref  reshjiode  0 
current ->LineType=GOP 
initinf o  0 

read  next  start  code 
goto  start 

case  picture_start_code : 


image  space  (vs_size) 

horizontal  size  of  image  (mb_width) 

vertical  size  of  image  (mbJieight) 
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[ParsePictureO  ,  video,  c] 

parse  picture  header 

ref  reshjiode  0 

current ->LineType=FRAME 

current ->frameNo=  'current  frame  #' 

current ->frameType=  'current  frame  type'  (OIPB) 

totalFrames++ 

if  frameType=I  numIFrames++ 
initinf o() 

read  next  start  code 
goto  start 

case  slice_start_code 

[ParseSliceO  ,  video. c] 
parse  slice  header 
ref  reshjiode  0 
current ->LineType=SLICE 
initinf o  0 

read  next  start  code 
goto  start 


default : 

[ParseMacroBlockO  ,  video,  c] 


if  (Previous  Macroblocks  skipped) 

{ 

[ProcessSkippedPFrameMBlocks () ,  video . c] 

[and  ProcessSkippedBFrameMBlocks 0 ,  video. c] 
for  each  skipped  MB,  do 
{ 

ref  reshjiode  0 
current ->LineType=BLOCK 

current->qscale= 'block  quantization  scale' 

current ->numBits=0 

current ->blockNo= ' current  block  #' 

current ->blockType=SKIP 

initinf o() 

} 

} 


mbtype= ' current  MB  type'  (F0RW,BACK, INTRA,BI ,  0) 
switch  mbtype 

case  FORW  : 
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forw->x= ' current  right  forw.  mv' 
forw->y= ' current  down  forw.  mv' 
update  maxMV  (all  directions,  if  necessary) 
case  BACK  : 

back->x= ' current  right  backw.  mv' 
back->y= ' current  down  backw.  mv' 
update  maxMV  (all  directions,  if  necessary) 
case  BI  : 

forw->x= ' current  right  forw.  mv' 
forw->y= ' current  down  forw.  mv' 
back->x= ' current  right  backw.  mv' 
back->y= ' current  down  backw.  mv' 
update  maxMV  (all  directions,  if  necessary) 
default  : 

ref  reshjiode  0 

current ->LineType=BLOCK 

current ->blockNo= ' current  block  #’ 

current ->blockType=mbtype 

current->qscale= ' current  quantization  scale' 
current ->numBits= 'number  of  bits  for  the  current  MB' 
current->f orw=f orw 
current ->back=back 
if  (DCT  info  needed) 

{ 

current->dctSpecs  =  'current  DCT  Collection  String' 

} 

initinf o  0 

read  next  start  code 
goto  start 


Since  the  structure  of  mpeg2stat  is  nearly  the  same  as  the  structure  of  mpeg_stat, 
this  algorithm  will  be  used  in  the  adaptation  of  mpeg2stat  to  MERIT.  Next,  we  have  to 
understand  how  mpeg2stat  works. 

3.3  mpeg2stat 

mpeg2stat  is  a  software  code  written  by  fan  Gordon  (©f998),  with  original  intent  to 
collect  and  output  statistics  from  MPEG-f  and  MPEG-2  hies,  ft  has  been  adapted 
from  the  MPEG  Software  and  Simulation  Group’s  decoder  mpegSdecode  (©MSSG  1996). 
ft  can  use  the  information  from  a  base  layer  hie  and  from  an  enhancement  layer  hie 
(scalability). 

The  statistics  given  by  mpeg2stat  are  basically  the  same  as  in  mpeg_stat.  Moreover, 
depending  on  the  verbose  level  (-v  option),  mpeg2stat  displays  MPEG-2  specihc  infor- 
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mation  at  different  levels  (Sequence,  GOP,  Picture,  Slice,  Macroblock).  With  the  trace 
option  (-t)  the  AC  coefficients  of  each  block  are  given. 

There  are  no  storing  structures  in  mpeg2stat,  since  the  decoded  information  is  output 
on  the  fly,  while  parsing  the  bit  stream.  Again,  mpeg2stat  follows  the  MPEG  specification 
structure,  giving  the  information  at  different  specified  levels. 

3.3.1  Parsing  techniques 

The  mainO  function  is  located  in  mpeg2dec.c,  and  only  processes  the  options.  Then, 
in  Decode_Bitstream()  [mpeg2dec . c],  a  first  call  to  Header ()  [mpeg2dec.c]  finds  out 
whether  there  is  a  video  stream.  HeaderO  just  calls  another  function,  GetJidrO  ,  located 
in  gethdr.c.  This  function  recognizes  the  header  type  (sequence_start,  GOP,  picture, 
slice,  sequence_end). 

After  the  first  call  to  HeaderO  ,  video_sequence  ()  [mpeg2dec .  c]  is  the  function  where 
the  routine  begins.  HeaderO  is  called  until  there  is  a  picture  header.  If  so, 
Decode_picture 0  [getpic.c]  is  called,  then  HeaderO  again.  This  routine  ends  when 
HeaderO  returns  an  end-of-sequence  message.  Decode_pictureO  basically  calls  several 
functions  which  collect  all  the  information  about  each  macroblock  in  the  picture,  in¬ 
cluding  macroblock  type,  motion  vectors  [getvlc.c],  and  DOT  coefficients  of  each  block 
[getblk.c].  The  architecture  of  this  program  is  quite  similar  to  that  of  mpeg^tat. 

3.3.2  Modifications  of  mpeg2stat 

To  prepare  mpeg2stat  for  MERIT,  it  was  necessary  to  debug  it  and  to  add  new  options. 
The  debugging  consisted  mainly  of  adding  some  outputs  forgotten  by  the  author.  In 
particular,  there  was  originally  no  output  of  DOT  coefficients  for  MPEG-1  files,  and  no 
output  of  the  DC  coefficients  used  by  MERIT.  Eurthermore,  the  way  DOT  coefficients  are 
collected  was  changed.  In  the  original  program,  the  DOT  coefficients  were  output  on  the 
fly  while  parsing  the  stream.  In  the  new  version,  the  DOT  coefficients  are  collected  in  a 
string  and  output  at  the  end  of  each  macroblock  parsing.  This  string,  the  dctSpecif  ics 
variable,  is  useful  for  the  upgrade  of  MERIT,  because  it  was  produced  by  mpeg^tat  and 
was  used  by  the  segmentation  algorithm.  After  these  changes  mpeg2stat  stores  the  DOT 
information  the  same  way  mpeg^tat  does. 

Einally,  -start  and  -end  options  were  added,  which  allow  the  user  to  collect  statistics 
only  for  a  specifled  range  of  frames.  These  two  options  are  necessary  for  the  validation 
option  in  MERIT. 

3.3.3  Integration 

Having  created  an  upgraded  version  of  mpeg2stat,  we  now  have  to  integrate  it  into 
MERIT,  just  as  mpeg^tat  was.  That  is,  the  main()  function  needs  to  be  modifled,  and 
the  same  InfoList  building  algorithm  must  be  included  in  mpeg2stat. 
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4  MERIT  4.0 


The  first  version  of  MERIT  4.0  was  an  alpha  version  whose  pnrpose  was  to  verify  the  pos¬ 
sibility  of  integrating  with  mpeg2stat.  We  hrst  focns  on  this  version.  Then  some  relevant 
MPEG-2  specihcations  are  discnssed,  and  the  hnal  version  is  ontlined  in  more  detail. 

4.1  MERIT  4.0  alpha 

The  main  idea  of  the  alpha  version  is  that  a  correct  nsage  of  the  InfoList  building  algo¬ 
rithm  wonld  allow  MERIT  to  perform  its  segmentation  algorithm.  This  version  indirectly 
nses  mpeg2stat,  insofar  as  it  needs  a  parsing  from  mpeg2stat  to  prodnce  an  info.t  hie. 
This  hie  is  basically  a  copy  of  the  standard  ontpnt  of  mpeg2stat  with  the  -t  trace  option 
obtained  by  redirection. 

Example:  mpeg2stat  -t  filename. mpg  >  info.t 

info.t  contains  all  the  pictnre,  macroblock  and  block  information  needed  to  per¬ 
form  block-based  segmentation.  Eor  MERIT  4.0  alpha,  all  the  fnnctions  dedicated  to 
segmentation  were  kept  and  removed  from  mpeg_stat.  A  main.c  hie  containing  a  new 
mpeg_stat  ()  fnnction  was  added  to  parse  the  info  .t  hie  and  collect  needed  information. 
While  parsing,  information  is  sent  to  the  segmentation  part  of  MERIT,  in  accord  with 
the  InfoList  building  algorithm. 

Example:  MERIT  info.t 

This  version  sends  only  the  macroblock  type  information,  omitting  the  DCT  segmen¬ 
tation.  The  resnlts  of  the  macroblock  type  algorithm  are  satisfactory,  when  tested  with 
MPEG-1  hies  or  MPEG-2  hies  coded  with  frame  pictnres.  This  hrst  series  of  tests  made 
it  apparent  that  MPEG-2  specihc  pictnre  strnctnres  snch  as  field  pictures.^  and  MPEG-2 
specihc  motion  compensation  modes  snch  as  field  prediction  or  Dual  Prime  mode,  wonld 
reqnire  preliminary  compntations  before  the  original  MERIT  segmentation  algorithm 
conld  be  nsed.  Another  problem  was  the  relatively  small  number  of  MPEG-2  video  hies 
available  on  the  Internet,  especially  when  looking  for  nncommon  pictnre  strnctnres  or 
motion  compensation  modes. 

4.2  MPEG-2  specificity  issues 

As  explained  in  Appendix  A,  MPEG-2  is  a  snperset  of  MPEG-1.  To  be  more  precise,  an 
MPEG-2  decoder  is  expected  to  be  able  to  read  MPEG-1  hies.  Bnt  an  MPEG-2  encoder 
cannot  write  an  MPEG-1  hie,  becanse  of  the  MPEG-2  specihcations.  At  best  an  MPEG- 
2  encoder  can  write  an  MPEG-2  hie  with  the  same  characteristics  as  an  MPEG-1  hie 
(frame  pictures,  frame  prediction,  frame  dct  and  progressive  sequence).  However,  since 
MERIT  has  been  written  for  MPEG-1  hies,  it  cannot  deal  with  MPEG-2  specihcations. 
We  chose  to  keep  the  segmentation  part  of  MERIT  as  it  was,  for  several  reasons.  Eirst, 
this  algorithm  was  written  by  others  and  it  was  risky  to  make  changes  to  it.  Second, 
an  adaptation  of  the  segmentation  algorithm  of  MERIT  conld  change  the  resnlts  for 
MPEG-1  hies,  which  is  not  expected  in  an  npgrade.  This  is  the  reason  why  mpeg2stat 
was  nsed  to  parse  the  MPEG-2  video  stream,  and  send  information  to  MERIT  in  the 
same  form  that  it  wonld  expect  from  an  MPEG-1  hie.  With  this  solntion,  MPEG-1  hies 
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are  handled  as  in  the  previons  version,  and  the  integrity  of  MERIT’S  algorithm  is  not 
endangered. 

Having  chosen  this  solntion,  it  became  necessary  to  hnd  how  to  handle  MPEG-2 
specihcations.  Here  are  the  solntions  that  were  chosen. 

•  Eor  frame  pictures  with  frame  prediction  and  frame  dct,  the  resnlts  are  the  same 
for  MPEG-f  and  MPEG-2  hies  (MPEG-f  nses  only  frame  pictures^  frame  motion 
compensation  and  frame  dct).  Eor  frame  pictnres  with  held  prediction,  MERIT 
nses  an  average  of  the  two  motion  vectors  corresponding  to  the  two  helds. 

•  Field  dct  gives  the  same  resnlts  as  frame  dct,  since  MERIT  nses  only  the  DC 
coefhcient  of  each  block,  which  does  not  depend  on  dct_type. 

•  Eor  field  pictures,  only  the  hrst  held  of  a  frame  is  nsed.  This  solntion  was  adopted 
to  avoid  ‘inter-held’  prediction  problems.  According  to  the  MPEG-2  specihcations, 
a  macroblock  from  the  second  held  of  a  frame  may  be  predicted  nsing  motion 
vectors  with  respect  to  the  hrst  held  of  the  same  frame.  Macroblocks  from  the 
hrst  held  can  be  predicted  only  with  respect  to  a  held  of  another  frame  (i.e.,  not 
the  second  held  of  the  same  frame).  Since  the  segmentation  algorithm  of  MERIT 
is  based  on  the  macroblock  inter-frame  prediction  type,  the  second  held  cannot 
be  nsed  withont  time-wasting  compntations.  Each  macroblock  from  the  hrst  held 
is  connted  twice  to  respect  the  total  nnmber  of  macroblocks.  The  two  identical 
macroblocks  are  natnrally  placed  one  above  the  other.  Since  the  macroblocks  are 
coded  in  horizontal  order  hrst,  the  identical  macroblock  is  stored  in  a  bnffer  nntil  the 
end  of  the  macroblock  line  is  reached.  Then  the  stored  macroblocks  are  processed 
and  the  bnffer  is  emptied  before  the  next  macroblock  line  in  the  held  is  read. 
Again,  in  the  case  of  16  X  8  MC  prediction,  an  average  of  the  two  motion  vectors 
is  compnted  and  nsed  as  the  nniqne  motion  vector  of  the  macroblock. 

•  When  Dual  Prime  prediction  is  nsed  (see  Section  A. 5. 5),  the  motion  vector  is  the 
coded  vector.  There  is  no  nse  inclnding  the  dm  vector,  since  its  length  wonldn’t 
signihcantly  change  the  motion  vector. 

All  these  approximations  of  MPEG-2  hies  imply  a  loss  of  precision.  Bnt  at  the  same 
time,  the  information  seems  to  be  snfhcient  according  to  the  resnlts  of  several  series  of 
tests,  and  nsing  less  information  increases  processing  speed.  The  next  section  describes 
how  these  adaptations  are  implemented  in  MERIT  4.0. 

4.3  MERIT  4.0  implementation 

In  this  hnal  version,  all  hies  from  the  segmentation  part  of  MERIT  (hlenames  beginning 
with  a  capital  letter)  were  kept,  and  the  mpeg2stat  hies  (hlenames  beginning  with  a 
lower-case  letter)  were  added.  The  mpeg2stat  hies  were  taken  from  the  modihed  version 
of  mpeg2stat.  None  of  the  hies  of  the  segmentation  component  of  MERIT  were  modihed, 
in  accordance  with  the  decisions  described  above.  Of  conrse,  the  mpeg2stat  hies  were 
modihed.  Eirst  of  all,  all  Trace  and  Verbose  ontpnts  were  commented  ont.  There  is 
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no  test  to  see  whether  Trace  Jlag  or  Verbose  Jlag  is  on  or  off,  to  increase  processing 
speed.  Moreover,  all  parts  involving  an  enhancement  layer  were  commented  ont,  since 
MERIT  nses  only  the  base  layer  hie.  The  following  hies  were  modihed: 

•  global. h 

Global. h  and  List  .h  were  inclnded  from  MERIT,  as  well  as  all  needed  external 
variables  and  fnnctions  (mainly  nsed  in  the  InfoList  building  algorithm) 

•  mpeg2dec.c 

The  mainO  fnnction  was  renamed  mpeg_stat()  (not  mpeg2stat(),  so  there  was 
no  need  to  change  the  name  when  calling  in  hies  from  MERIT).  There  is  a  short 
options  setnp.  The  InfoList  building  starts  here  with  FRAME  Line  npdate,  and 
a  call  to  initInfo().  Obsolete  fnnctions  snch  as  Process_options  () ,  UsageO, 
Print_options  0  and  Output_Statistics  ()  have  been  cnt  off. 

•  gethdr.c 

InfoList  building  for  SLICE  Line  and  GOP  Line.  The  lines  reqnired  by  this  algo¬ 
rithm  were  added,  inclnding  reset  of  the  current  pointer,  npdate  of  the  pointed 
Line  strnctnre  with  the  parsed  information,  and  calls  to  the  initInfo()  fnnctions. 

•  getpic.c 

InfoList  building  for  BLOCK  Line  as  explained  for  SLICE  and  GOP,  with  some  needed 
compntations  snch  as  motion  vector (s).  Moreover,  when  dealing  with  a  held  pictnre, 
a  Line  bnffer  is  bnilt,  so  that  one  macroblock  can  be  sent  twice  to  MERIT  via  the 
InfoList. 

•  getblk.c 

Test  for  dct  hag.  If  dct  =  f ,  DCT  coefhcients  are  stored  in  a  dct Specifics  string. 
The  coefhcients  are  stored  in  this  string  on  the  hy  while  parsing  the  stream.  This 
test  improves  processing  speed  when  blockinfo  mpegjnode  is  nsed. 

The  other  hies  from  mpeg2stat  have  not  been  modihed  and  are  nsed  as  in  mpeg2stat. 

4.4  Tests  and  further  possible  improvements 

This  version  was  tested  with  several  hies.  The  resnlts  were  as  follows: 

•  Eor  MPEG-f  hies,  the  resnlts  are  exactly  the  same,  for  all  possible  options.  This 
was  predictable,  since  only  the  parsing  part  was  modihed.  The  only  difference  is 
the  processing  speed  which  is  sensibly  lower  with  this  new  version.  This  is  probably 
dne  to  the  increased  nnmber  of  tests  dnring  the  parsing  process. 

•  Eor  MPEG-2  hies  nsing  frame  pictures  (frame  or  field  prediction,  frame  or  field 
dct)  the  resnlts  are  similar  to  those  for  MPEG-f  hies.  All  the  cnts  are  fonnd 
for  nnambignons  clips  (as  in  MPEG-f).  The  DCT  based  segmentation  algorithm 
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gives  odd  results,  but  this  already  occurred  with  MERIT  3.3.  The  -showKey  and 
similar  options  do  not  work  with  MPEG-2  hies,  because  these  options  directly  use 
mpeg_play,  which  doesn’t  support  MPEG-2  hies.  The  options  used  by  mpeg_play 
in  this  case  are  not  provided  by  mpeg2play. 

•  Eor  MPEG-2  hies  with  field  pictures,  the  results  are  good  as  well.  Of  course, 
with  the  -dispMV  option,  two  consecutive  lines  of  macroblocks  are  identical,  and 
the  displayed  macroblock  types  do  not  show  the  reality  of  the  encoded  picture. 
But  this  solution  seems  to  work.  The  segmentation  is  actually  performed  on  held 
pictures  with  half  the  height  of  displayed  frame  pictures. 

All  the  test  hies  were  in  4:2:0  chroma  format.  MERIT  4.0  has  not  yet  been  tested 
with  other  formats,  but  the  block-based  segmentation  should  also  work  for  those.  Eor 
the  DOT  segmentation,  MERIT  currently  stops  the  parsing  of  the  dctSpecif  ics  string 
after  the  6th  block,  so  that  only  the  hrst  two  chrominance  blocks  are  taken  into  account. 

It  is  worth  noting  that  all  the  tests  were  performed  with  reconstructed  MPEG-2 
hies.  That  is,  it  was  necessary  to  decode  MPEG-f  hies  and  encode  them  as  MPEG-2 
hies,  because  MPEG-2  video  sequences  with  cuts  could  not  be  found  on  the  Internet. 
However,  the  MSSG  encoder  mpeg2encode  was  used,  and  it  was  possible  to  construct 
almost  all  types  of  MPEG-2  hies.  Eor  interlaced  video,  interlaced  pictures  decoded  from 
an  MPEG-2  video  clip  without  cuts  were  used,  and  they  were  reordered  to  provide  some 
cuts  in  the  clip  before  encoding.  It  was  not  possible  to  test  Dual  Prime  predicted  MBs 
nor  clips  with  both  field-  and  frame-pictures  in  the  same  stream,  because  mpeg2encode 
does  not  provide  for  these  possibilities. 

Here  is  a  non-exhaustive  list  of  possible  improvements: 

•  A  solution  should  be  found  for  the  -showKey  option.  Possibly  more  options  should 
be  added  to  mpeg2play. 

•  Eor  sequences  with  field-pictures,  an  efhcient  use  of  the  second  held  of  a  frame  could 
be  added,  at  least  to  provide  a  correct  output  when  using  the  -dispMV  option. 

5  Conclusion 

The  new  version  of  MERIT  is  now  able  to  analyze  MPEG-2  hies.  The  modihcations  were 
made  in  accordance  with  the  MPEG-2  specihcations  and  using  the  original  segmentation 
algorithm.  MPEG-1  hies  are  analyzed  exactly  as  in  the  former  version,  and  in  case  of 
MPEG-2  hies,  the  information  is  processed  before  being  analyzed  by  the  segmentation 
algorithm.  This  approach  gives  good  results,  according  to  our  tests. 
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A  Appendix:  The  MPEG-1  and  MPEG-2  video  standards 
A.l  Introduction 

The  Moving  Picture  Experts  Group  (MPEG)  [12]  is  a  working  group  of  ISO/IEC  in 
charge  of  the  development  of  international  standards  for  compression,  decompression, 
processing,  and  code  representation  of  moving  pictures,  audio  and  their  combination. 
The  MPEG-f  standard  was  approved  in  Nov.  f992,  and  MPEG-2  in  Nov.  f994. 

A. 2  MPEG  video  coder  source  model 

The  MPEG  digital  video  coding  techniques  are  statistical  in  nature  [7].  Video  sequences 
usually  contain  statistical  redundancies  in  both  temporal  and  spatial  dimensions.  The 
basic  statistical  property  upon  which  MPEG  compression  techniques  rely  is  inter-pixel 
correlation,  including  the  assumption  of  simple  correlated  translational  motion  between 
consecutive  frames.  Thus,  it  is  assumed  that  the  magnitude  of  a  particular  image  pixel 
can  be  predicted  from  nearby  pixels  within  the  same  frame  (using  Intra-frame  coding 
techniques)  or  from  pixels  of  a  nearby  frame  (using  Inter- frame  techniques).  The  MPEG 
compression  algorithms  employ  Discrete  Cosine  Transform  (DOT)  coding  techniques  on 
image  blocks  of  8  X  8  pixels  to  efficiently  exploit  spatial  correlations  between  nearby  pixels 
within  the  same  image. 

However,  if  the  correlation  between  pixels  in  nearby  frames  is  high,  i.e.  in  cases  where 
two  consecutive  frames  have  similar  or  identical  content,  it  is  desirable  to  use  Inter-frame 
DPCM  coding  techniques  employing  temporal  prediction  (motion-compensated  predic¬ 
tion  between  frames).  In  MPEG  video  coding  schemes  an  adaptive  combination  of  both 
temporal  motion-compensated  prediction  followed  by  transform  coding  of  the  remaining 
spatial  information  is  used  to  achieve  high  data  compression  (hybrid  DPCM/DCT  coding 
of  video). 

A. 3  Compression  techniques 

A. 3.1  Subsampling  and  interpolation 

The  basic  concept  of  subsampling  is  to  reduce  the  dimension  of  the  input  video  (horizontal 
dimension  and/or  vertical  dimension)  and  thus  the  number  of  pixels  to  be  coded  prior 
to  the  encoding  process.  At  the  receiver  the  decoded  images  are  interpolated  for  display. 
Since  the  human  eye  is  more  sensitive  to  changes  in  brightness  than  to  chromaticity 
changes,  the  MPEG  coding  schemes  hrst  divide  the  images  into  YUV  components  (one 
luminance  and  two  chrominance  components);  then,  the  chrominance  components  are 
subsampled  relative  to  the  luminance  component  with  a  Y:U:V  ratio  specihc  to  particular 
applications  (with  the  MPEG-2  standard,  a  ratio  of  4:f:f,  4:2:2,  or  4:4:4  is  used). 

A. 3. 2  Transform  domain  coding 

The  purpose  of  transform  coding  is  to  de-correlate  the  image  content  and  to  encode 
transform  coefficients  rather  than  the  original  pixels  of  the  images.  To  this  end  the  input 
images  are  split  into  disjoint  blocks  of  pixels.  Of  many  possible  alternatives,  the  Discrete 
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Cosine  Transform  (DCT)  applied  to  small  image  blocks,  usually  8x8  pixels,  has  become 
the  most  successful  transform  for  still  image  and  video  coding.  A  major  objective  of 
transform  coding  is  to  make  as  many  transform  coefficients  as  possible  small  enough  so 
that  they  are  insignihcant  and  need  not  be  coded  for  transmission. 

The  DCT  coefficients  are  put  into  a  matrix,  called  the  DCT  matrix.  On  average  only 
a  small  number  of  DCT  coefficients  need  to  be  transmitted  to  the  receiver  to  obtain  a 
valuable  approximate  reconstruction  of  the  image  blocks.  Moreover,  the  most  signih- 
cant  DCT  coefficients  are  concentrated  around  the  upper  left  corner  of  the  matrix  (low 
DCT  coefficients)  and  the  signihcance  of  the  coefficients  decays  with  increased  distance. 
Since  the  human  viewer  is  more  sensitive  to  reconstruction  errors  related  to  low  spatial 
frequencies  than  to  high  frequencies,  a  frequency- adaptive  weighting  (quantization)  of 
the  coefficients  according  to  human  visual  perception  (perceptual  quantization)  is  often 
employed  to  improve  the  visual  quality  of  the  decoded  images  for  a  given  bit  rate. 

A. 3. 3  Motion  compensated  prediction 

The  concept  of  motion  compensation  is  based  on  the  estimation  of  motion  between  video 
frames,  i.e.  if  all  elements  in  a  video  scene  are  spatially  displaced,  the  motion  between 
frames  can  be  described  by  a  motion  vector.  To  this  end  images  are  usually  separated  into 
disjoint  blocks  of  pixels  (16  X  16  pixels  in  the  MPEG-1  and  MPEG-2  standards)  and  only 
one  motion  vector  is  estimated,  coded  and  transmitted  for  each  of  these  blocks.  In  the 
MPEG  compression  algorithms  the  motion-compensated  prediction  techniques  are  used 
for  reducing  temporal  redundancies  between  frames  and  only  the  prediction  error  images 
—  the  differences  between  the  original  images  and  the  motion  compensated  prediction 
images  —  are  encoded,  using  the  DCT  technique. 

A. 4  The  MPEG-1  standard 

The  MPEG-1  standard  is  formally  referred  to  as  ISO  11172  and  consists  of  several  parts 
(1.  System,  2.  Video,  3.  Audio,  4.  Conformance,  5.  Software).  We  will  focus  on  the 
video  part  11172-2.  The  MPEG-1  Video  standard  was  originally  aimed  at  coding  video 
of  SIE  resolution  (352  x  240  at  30  noninterlaced  frames/s  or  352  x  288  at  25  noninter¬ 
laced  frames/s)  at  bit  rates  of  about  1.5  Mbits/s,  for  applications  such  as  CD-i  (compact 
disc  interactive).  However,  it  also  allows  much  larger  picture  sizes  and  correspondingly 
higher  bit  rates.  This  standard  specihes  the  video  bit  stream  syntax  and  the  correspond¬ 
ing  video  decoding  process.  The  basic  MPEG-1  video  compression  technique  is  based 
on  a  macroblock  structure,  motion  compensation,  and  the  conditional  replenishment  of 
macroblocks. 

A. 4.1  Macroblocks  and  I-frames 

The  MPEG-1  coding  algorithm  encodes  the  hrst  frame  of  a  video  sequence  in  Intra¬ 
frame  coding  mode  (I-picture),  by  using  block-based  DCT  coding  of  8  X  8  pixel  blocks, 
followed  by  quantization.  The  DC  coefficient  (average  value  and  hrst  coefficient)  is 
quantized  with  a  uniform  midstep  quantizer  with  stepsize  as  specihed  by  the  param¬ 
eter  intra_dc_precision  =  3  —  log2  stepsize.  AC  coefficients  (the  other  61  coefficients) 
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Figure  1:  Macroblock  structure  [6],  A  macroblock  comprises  four  luminance  blocks  (Yl, 
Y2,  Y3,  Y4)  and  two  chrominance  blocks  (Cr,  Cb).  Each  block  has  a  size  of  8  X  8  pixels. 


are  quantized  with  a  uniform  midstep  quantizer  having  a  stepsize  under  control  of  the 
parameter  quantizer_scale.  A  high  stepsize  decreases  the  number  of  bits  needed  to 
transmit  the  information,  but  also  decreases  the  image  quality.  Each  color  input  frame 
in  a  video  sequence  is  partitioned  into  non-overlapping  macroblocks.  Each  macroblock 
contains  six  8x8  blocks  of  data  from  both  luminance  and  co-sited  chrominance  bands  — 
four  luminance  blocks  and  two  chrominance  blocks,  each  of  size  8x8  pixels  (Fig.  1).  Thus 
the  sampling  ratio  between  Y:U:V  luminance  and  chrominance  pixels  is  4:1:1  (sometimes 
called  4:2:0).  For  an  1-picture,  the  frame  is  partitioned  into  such  macroblocks.  Then 
each  luminance  and  chrominance  block  from  each  macroblock  is  coded  using  the  DCT 
technique. 

A. 4. 2  Zig-zag  scanning 

The  concept  of  zig-zag  scanning  of  the  coefficients  is  outlined  in  Fig.  2.  The  zig-zag  scan 
attempts  to  trace  the  DCT  coefficients  according  to  their  signihcance,  from  the  top-left 
corner  to  the  bottom-right  corner.  Only  the  non-zero  quantized  DCT  coefficients  are 
encoded.  The  scanning  of  the  quantized  DCT-domain  2-dimensional  signal  followed  by 
variable-length  code-word  assignment  {Huffman  coding)  for  the  coefficients  serves  as  a 
mapping  of  the  2-dimensional  image  signal  into  a  1-dimensional  bit  stream.  The  non¬ 
zero  AC  coefficient  quantizer  values  (length)  are  detected  along  the  scan  line  as  well  as 
the  distance  (run)  between  two  consecutive  non-zero  coefficients.  Each  consecutive  (run, 
length)  pair  is  encoded  by  transmitting  only  one  codeword  (Run  Length  Encoding). 

A. 4. 3  P-frames  and  B-frames 

Each  subsequent  frame  is  coded  using  Inter-frame  prediction  (predicted  pictures,  or  P- 
pictures)  —  only  data  from  the  nearest  previously  coded  T  or  P-frame  is  used  for  pre¬ 
diction.  For  coding  P-pictures,  the  previous  T  or  P-picture  frame  A  —  1  is  stored  in 
a  frame  store  in  both  encoder  and  decoder.  Motion  compensation  is  performed  on  a 
macroblock  basis  —  only  one  motion  vector  is  estimated  between  frame  N  and  frame 
N  —  1  for  a  particular  macroblock  to  be  encoded.  These  motion  vectors  are  then  coded. 
The  motion-compensated  prediction  error  is  calculated  by  subtracting  each  pixel  in  a 
macroblock  from  its  motion-shifted  counterpart  in  the  previous  frame.  A  8  X  8  DCT 
is  then  applied  to  each  of  the  8x8  blocks  contained  in  the  error-image-macroblock, 
followed  by  quantization  of  the  DCT  coefficients  with  subsequent  zig-zag  scanning  and 
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Figure  2:  Zig-zag  scan  of  the  DCT  coefficients  of  an  8  X  8  block.  [8] 


run-length  coding.  The  quantization  stepsize  can  be  adjusted  for  each  macroblock  in  a 
frame  (Fig.  3).  The  advantage  of  coding  video  using  motion  compensation  techniques 
is  the  reduction  of  the  residual  signal  to  be  coded  compared  to  pure  frame  difference 
coding. 

To  further  explore  the  signihcant  advantages  of  motion  compensation  and  motion  in¬ 
terpolation,  the  concept  of  B-pictures  (bidirectionally  predicted  pictures)  was  introduced 
by  MPEG-f.  B-pictures  can  be  coded  using  motion-compensated  prediction  based  on 
the  two  nearest  already  coded  frames  (either  Tpictures  or  P-pictures).  Since  the  coding 
order  of  the  pictures  is  not  the  same  as  the  displaying  order,  B-pictures  can  use  both 
past  and  future  frames  as  references  (Fig.  4).  The  user  can  arrange  the  picture  types  in  a 
video  sequence  with  a  high  degree  of  flexibility  to  suit  diverse  application  requirements. 

A. 4. 4  Conditional  replenishment 

An  essential  feature  supported  by  MPEG-f  is  the  possibility  of  updating  macroblock 
information  at  the  decoder  only  if  needed  (i.e.  if  the  content  of  the  macroblock  has 
changed  in  comparison  to  the  content  of  the  same  macroblock  in  the  previous  frame). 
This  feature  is  called  conditional  replenishment.  The  key  to  efficient  coding  of  video 
sequences  at  low  bit  rates  is  the  selection  of  appropriate  prediction  modes  to  achieve 
conditional  replenishment.  The  MPEG-f  standard  distinguishes  between  three  different 
macroblock  coding  types  (MB  types): 

Skipped  MB  —  prediction  from  previous  frame  with  zero  motion  vector.  No  information 
about  the  macroblock  is  coded  or  transmitted  to  the  receiver. 

Inter  MB  —  motion-compensated  prediction  from  the  previous  frame  is  used.  The  MB 
type,  the  MB  address  and,  if  required,  the  motion  vector,  the  DCT  coefficients  and 
quantization  stepsize  are  transmitted. 

fntra  MB  —  no  prediction  is  used  from  the  previous  frame  (intra- frame  prediction  only). 
Only  the  MB  type,  the  MB  address,  the  DCT  coefficients  and  the  quantization 
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target 


0100110  ... 


Figure  3:  Forward  prediction  of  a  macroblock  in  a  P-picture  [6].  DCT:  Discrete  Cosine 
Transform,  Quant.:  Quantization,  RTF:  Run  Length  Fncoding 


past  reference  target  future  reference 


Figure  4:  Bidirectional  prediction  of  a  macroblock  in  a  B-picture  [6].  DCT:  Discrete 
Cosine  Transform,  Quant.:  Quantization,  RTF:  Run  Length  Fncoding 
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_  Video  Sequence  - 

-  Group  of  Pictures  — »| 


Figure  5:  Video  sequence  structure  as  specified  by  MPEG  [6],  A  Group  of  Pictures 
usuaffy  begins  with  an  I-picture,  foUowed  by  P-  and  B-pictures.  Each  picture  is  divided 
into  sfices,  macrobfocks  and  bfocks. 


stepsize  are  transmitted  to  the  receiver. 

A. 4. 5  The  MPEG-1  video  stream  syntax 

Here  is  a  coarse  view  of  the  video  stream  syntax.  In  typical  MPEG-f  encoding,  an 
input  video  sequence  is  divided  into  units  of  groups- of -pictures  (GOPs),  where  each  GOP 
consists  of  an  arrangement  of  one  I-picture,  P-pictures,  and  B-pictures.  A  GOP  serves 
as  a  basic  access  unit.  Each  picture  is  divided  further  into  one  or  more  slices  that  offer 
a  mechanism  for  resynchronization  and  thus  limit  the  propagation  of  errors.  Each  slice 
is  composed  of  a  number  of  macroblocks.  Each  macroblock  is  composed  of  four  8x8 
luminance  blocks  and  two  chrominance  blocks  (see  Fig.  5).  In  P-pictures  each  macroblock 
can  have  one  motion  vector,  whereas  in  B-pictures  each  macroblock  can  have  as  many  as 
two  motion  vectors.  Each  part  of  this  syntax  (GOPs,  Pictures,  Slices)  is  encoded  with  a 
special  header  (as  shown  in  Fig.  6),  and  programs  such  as  decoders  or  MERIT  parse  the 
MPEG-hle  video  bit  stream,  and  recognize  these  headers.  Thus,  they  can  efficiently  deaf 
with  further  data. 

A. 5  The  MPEG-2  standard 

MPEG-2  is  an  extension  of  the  MPEG-I  international  standard  for  digital  compression  of 
audio  and  video  signals.  MPEG-2  is  directed  at  broadcast  formats  at  higher  data  rates; 
it  provides  extra  algorithmic  tools  for  efficiently  coding  interlaced  video,  supports  a  wide 
range  of  bit  rates,  and  provides  for  multichannel  surround  sound  coding.  It  is  also  known 
as  ISO/IEC  I38I8. 

The  MPEG-2  standard  is  capable  of  coding  standard-dehnition  television  at  bit  rates 
from  about  3-15  Mbit/s  and  high-dehnition  television  at  15-30  Mbit/s.  Since  MPEG-2 
is  a  superset  of  MPEG-I,  MPEG-2  decoders  will  also  decode  MPEG-I  bit  streams. 
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quantised  DCT  coefficients  for  one  8x8  block  (variable  length  coded) 


Block  Layer 


macroblock 

mode 

(optional) 

motion 

coded 

lum  inance 

chrom . 

address 

quantisation  value 

vectors 

block  pattern 

blocks 

blocks 

Macroblock  layer 
(containing  four  luminance 
and  two  chrominance  bloci 
for  4:2:0  video) 


start 

slice 

quantisation 

macroblock 

macroblock 

macroblock 

code 

address 

value 

0 

1 

fl-1 

Slice  layer 
(containing  n 
macroblocks) 


start 

picture 

slice 

slice 

dice 

code 

flags 

0 

1 

Picture  layer 
(containing  m 
slices) 


start 

sequence 

(optional) 
quantisation 
>Aeiahtina  matrix 

profile 

picture 

picture 

picture 

code 

parameters 

and  level 

0 

1 

p-1 

Sequence  layer 
(containing  p 
pictures) 


Figure  6:  Bit  stream  structure  as  specified  by  MPEG  [8],  Each  picture  is  divided  into 
m  horizontal  slices,  each  comprising  n  macroblocks.  For  4:2:0  video,  each  macroblock 
contains  four  luminance  and  two  chrominance  8x8  blocks  of  quantized  DCT  coefficients. 
The  prohle  and  level  indication  appears  only  in  MPEG-2  hies. 


A. 5.1  Profiles  and  levels 

The  implementation  of  the  full  syntax  of  MPEG-2  may  not  be  practical  for  most  ap¬ 
plications.  MPEG-2  has  introduced  the  concept  of  “Prohles”  and  “Levels”  to  stipulate 
conformance  between  equipment  not  supporting  the  full  implementation.  Prohles  and 
levels  provide  means  for  dehning  subsets  of  the  syntax  and  thus  the  decoder  capabilities 
required  to  decode  a  particular  bit  stream.  A  prohle  is  a  subset  of  algorithmic  tools  and  a 
level  identihes  a  set  of  constraints  on  parameter  values  (such  as  picture  size  and  bit  rate). 
The  specihcations  of  each  level  and  each  prohle  are  described  in  Table  1  and  Table  2. 
A  decoder  that  supports  a  particular  prohle  and  level  is  only  required  to  support  the 
corresponding  subset  of  the  full  standard  and  set  of  parameter  constraints.  The  main 
prohle  at  the  main  level  is  the  most  common  type  and  is  referred  to  as  MP@ML. 

Currently,  the  major  interest  is  in  the  main  prohle  at  the  main  level  for  applications 
such  as  digital  television  broadcasting  (terrestrial,  satellite  and  cable),  video-on-demand 
services,  and  desktop  video  systems. 

A.5.2  MPEG-2  MAIN  Profile  and  MPEG-1 

The  MPEG-2  algorithm  dehned  in  the  MAIN  Prohle  is  a  straightforward  extension  of 
the  MPEG-f  coding  scheme  to  accommodate  coding  of  interlaced  video,  while  retaining 
the  full  range  of  functionality  provided  by  MPEG-f.  Identical  to  the  MPEG-f  stan¬ 
dard,  the  MPEG-2  coding  algorithm  is  based  on  the  general  hybrid  DCT/DPCM  coding 
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Level 

Parameters 

High 

1920  samples/line 
1152  lines/frame 

60  frames/s 

80  Mbit/s 

High  1440 

1440  samples/line 
1152  lines/frame 

60  frames/s 

60  Mbit/s 

Main 

720  samples/line 
576  lines/frame 

30  frames/s 

15  Mbit/s 

Low 

352  samples/line 
288  lines/frame 

30  frames/s 

4  Mbit/s 

Table  1:  Upper  bounds  of  parameters  at  each  level  of  a  profile. 


Profile 

Algorithms 

High 

Supports  all  functionality  provided  by  the  Spatial  Scalable  Prohle 
plus  the  provision  to  support  three  layers  with  the  SNR  and  Spa¬ 
tial  scalable  coding  modes.  4:2:2  YUV-representation  for  improved 
quality  requirements 

Spatial  scalable 

Supports  all  functionality  provided  by  the  SNR  Scalable  Prohle  plus 
an  algorithm  for  Spatial  scalable  coding  (two  layers  allowed).  4:0:0 
YUV-representation 

SNR  scalable 

Supports  all  functionality  provided  by  the  Main  Prohle  plus  an 
algorithm  for  SNR  scalable  coding  (two  layers  allowed).  4:2:0  YUV- 
representation 

Main 

Non-scalable  coding  algorithm  supporting  functionality  for  coding 
interlaced  video,  random  access  and  B-picture  prediction  modes. 
4:2:0  YUV-representation 

Simple 

Includes  all  functionality  provided  by  the  Main  Prohle  but  does  not 
support  B-picture  prediction  modes.  4:2:0  YUV-representation 

Table  2:  Algorithms  and  functionalities  supported  by  each  prohle. 
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scheme,  incorporating  a  macroblock  strnctnre,  motion  compensation,  and  coding  modes 
for  conditional  replenishment  of  macroblocks.  The  concept  of  Tpictnres,  P-pictnres  and 
B-pictnres  is  fnlly  retained  in  MPEG-2  to  achieve  efficient  motion  prediction  and  to  assist 
random-access  fnnctionality. 

A. 5. 3  Interlaced  video 

The  MPEG-f  standard  deals  only  with  progressive  video.  That  is,  all  the  pixels  of  a 
frame  have  been  taken  at  the  same  instant,  as  in  him.  The  original  objective  of  MPEG-2 
was  to  efficiently  code  interlaced  video,  which  is  mainly  nsed  in  television.  Television 
services  in  the  United  States  cnrrently  broadcast  video  at  a  frame  rate  of  jnst  nnder 
30  Hz  (29.97  Hz).  Each  frame  consists  of  two  interlaced  helds,  giving  a  held  rate  of 
approximately  60  Hz.  The  hrst  held  of  each  frame  contains  only  the  odd-nnmbered 
lines  (top  field)  of  the  frame  (nnmbering  the  top  frame  line  as  line  1).  The  second  held 
contains  only  the  even-nnmbered  lines  (bottom  field)  of  the  frame  and  is  sampled  in  the 
video  camera  20  ms  after  the  hrst  held.  It  is  important  to  note  that  one  interlaced  frame 
contains  helds  from  two  instants  in  time.  Enropean  television  is  similarly  interlaced  bnt 
with  a  frame  rate  of  25  Hz. 

MPEG-2  introdnced  the  concept  of  frame  pictnres  and  held  pictnres  along  with  par- 
ticnlar  frame  prediction  and  held  prediction  modes  to  accommodate  coding  of  progressive 
and  interlaced  video.  Eor  interlaced  seqnences  it  is  assnmed  that  the  coder  inpnt  consists 
of  a  series  of  odd  and  even  helds  that  are  separated  in  time  by  a  held  period.  Two  helds  of 
a  frame  may  be  coded  separately  (held  pictnres).  In  this  case  each  held  is  separated  into 
adjacent  non-overlapping  macroblocks  and  the  DOT  is  applied  on  a  held  basis.  Alterna¬ 
tively,  two  helds  may  be  coded  together  as  a  frame  (frame  pictnres),  as  in  conventional 
coding  of  progressive  video  seqnences.  Here,  consecntive  lines  of  the  top  and  bottom 
helds  are  simply  merged  to  form  a  frame.  It  is  worth  noting  that  both  frame  pictnres 
and  held  pictnres  can  be  nsed  in  a  single  video  seqnence. 

A. 5. 4  Picture  types 

The  MPEG-2  syntax  specihes  the  different  types  of  pictnres  that  may  be  coded  and 
displayed  [9].  Several  variables  are  nsed  to  dehne  a  pictnre  type,  as  shown  in  Eig.  7. 
Here  we  will  focns  only  on  the  semantic  meaning  of  the  relevant  variables. 

•  Seqnence  level: 

progressive_sequence  is  a  f-bit  integer,  which  indicates  whether  the  seqnence 
contains  only  progressive  frame  pictnres  (1)  or  not  (0).  This  is  the  primary 
switch  between  interlaced  and  progressive  video  sonrces,  and  gives  the  display 
mode  (progressive  or  interlaced) 

•  Pictnre  level: 

progress ivejframe  (f-bit  integer)  f  indicates  that  the  two  helds  of  the  frame 
correspond  to  the  same  instant,  for  example,  him.  It  gives  the  video  caption 
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mode  (progressive  or  interlaced  video).  If  progressive_sequence  =  1  then 
we  mnst  have  progressivejframe  =  1.  That  is,  an  originally  interlaced  frame 
cannot  be  displayed  as  a  progressive  frame.  Nevertheless,  MPEG-2  allows  for 
progressive  coded  pictnres,  bnt  interlaced  display  (f rame_picture  =  1  and 
progressive_5equence  =  0). 

picture_structure  dehnes  the  way  a  pictnre  is  internally  coded.  This  2-bit  integer 
takes  the  following  valnes:  3  (If)  for  a  frame  pictnre,  f  (Of)  for  a  top-held 
pictnre,  and  2  (10)  for  a  bottom-held  pictnre.  In  the  case  of  frame  pictnres, 
the  two  helds  are  interleaved  to  form  a  frame,  and  for  held  pictnres,  the  two 
helds  are  coded  separately.  In  this  case,  the  pictnre  coding  type  (I,  P,  B)  is 
the  same  for  the  two  helds  of  the  same  frame,  except  for  Tpictnres,  where  the 
second  held  can  be  a  P-pictnre.  If  progressivejframe  =  1  (progressive  coded 
pictnre),  the  pictnre  mnst  have  a  frame  strnctnre. 

topjf  ieldjf  irst  and  repeat  jfirstjfield  are  two  indicators  whose  meaning  de¬ 
pends  on  the  valnes  of  progressive_sequence  and  picture_structure.  If 
progressive_sequence  =  0  and  picture_structure  =  3  (frame),  then 
topjf  ieldjf  irst  indicates  which  of  the  two  helds  mnst  be  displayed  hrst 
(top:l  bottomiO)  and  repeat  jfirstjfield  indicates  whether  the  hrst  held 
shonld  be  repeated  to  respect  the  top-bottom  display.  Note  that  repeat  jfirstjfield 
cannot  be  eqnal  to  1  if  progressivejframe  =  0.  If  picture_structure  =  1 
or  2  (held  pictnres),  then  top_f  ieldjf  irst  =  0  and  repeat  jfirstjfield  =  0, 
and  the  display  order  is  hrst  coded  hrst  displayed.  If  progressive_sequence 
=  1  then  the  two  bits  [top_f  ieldjf  irst ,  repeat  jfirst_field]  give  the 
nnmber  of  times  the  progressive  pictnre  is  displayed  (00:  1  frame  (MP@ML), 

01:  2  frames,  11:  3  frames) 

To  snmmarize,  progressive_sequence  gives  the  display  mode  (interlaced/progres- 
sive),  progressivejframe  the  original  type  of  pictnre  (interlaced/progressive),  and 
picture_structure  the  pictnre  storage  mode  (frame  pictnre/held  pictnre). 

A. 5. 5  Frame  and  field  predictions 

New  motion-compensated  held  prediction  modes  were  introdnced  by  MPEG-2  to  efh- 
ciently  encode  held  pictnres  and  frame  pictnres.  In  held  prediction,  predictions  are  made 
independently  for  each  held  by  nsing  data  from  one  or  more  previonsly  decoded  helds, 
i.e.  for  a  top  held  a  prediction  may  be  obtained  from  either  a  previonsly  decoded  top 
held  (nsing  motion  compensated  prediction)  or  from  the  previonsly  decoded  bottom  held 
belonging  to  the  same  pictnre.  An  indication  of  which  reference  held  is  nsed  for  predic¬ 
tion  is  transmitted  with  the  bit  stream.  Within  a  held  pictnre  all  predictions  are  held 
predictions. 

Erame  prediction  makes  a  prediction  for  a  frame  pictnre  based  on  one  or  more  previ¬ 
onsly  decoded  frames.  In  a  frame  pictnre  either  held  or  frame  predictions  may  be  nsed  and 
the  particnlar  prediction  mode  preferred  can  be  selected  on  a  macroblock-by-macroblock 
basis.  Here  is  a  description  of  each  possible  prediction  type. 
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Figure  7:  MPEG-2  picture  types  and  corresponding  prediction  modes  and  dct  types. 
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16 


Figure  8:  Field  prediction  for  frame  pictures.  Target  macroblocks  are  split  into  top-field 
pixels  and  bottom-field  pixels. 


•  Frame  pictures 

Frame  prediction  is  exactly  the  same  as  in  MPEG-f.  Each  macroblock  has  up 
to  one  motion  vector  (forward  prediction)  in  P-frames  and  up  to  two  motion 
vectors  (forward  and  backward)  in  B-frames. 

Field  prediction  for  frame  pictures  is  a  prediction  mode  where  the  target 
macroblock  is  hrst  split  into  top-held  pixels  and  bottom-held  pixels,  consti¬ 
tuting  two  16  X  8  “held  macroblocks”  (see  Fig.  8).  Then,  for  each  of  these 
two  half-MBs,  the  prediction  half-MB(s)  is  (are)  found  in  previous  reference 
helds.  For  P-frame  pictures,  the  prediction  half-MB  may  come  from  either 
held  of  the  two  most  recently  coded  T  or  P-frames.  For  B-frame  pictures,  the 
backward  prediction  half-MB  is  taken  from  either  held  of  the  most  recently 
coded  T  or  P-frame,  and  the  forward  prediction  half-MB  from  either  held  of 
the  last  but  one  T  or  P-frame.  To  select  the  prediction  held  used,  MPEG  hlls 
the  motion_verticaUield_select  [r]  [s]  variable  as  described  below.  Up 
to  two  motion  vectors  are  assigned  to  each  MB  in  a  P-frame  picture  (one  for 
each  half-held-MB),  and  up  to  four  in  a  B-frame  picture. 

motion_vertical  jf  ield_select  [r]  [s]  gives  the  held  selected  for  the  predic¬ 
tion.  Index  r=0  indicates  the  hrst  MV  (hrst  half-MB)  and  r=l  indicates  the 
second  MV  (second  half-MB).  Index  s=0  indicates  the  forward  MV,  s=l  the 
backward  MV.  0  indicates  the  top  held,  1  indicates  the  bottom  held. 

Dual-Prime  mode  is  only  used  for  P-pictures.  With  this  mode  there  is  only 
one  motion  vector,  from  which  two  preliminary  predictions  are  computed. 
The  hrst  preliminary  prediction  is  identical  to  frame  prediction,  except  that 
each  prediction  pixel  must  have  the  same  parity  as  the  target  pixel.  The 
second  preliminary  prediction  is  derived  using  a  computed  motion  vector  plus 
a  small  differential  motion  vector  (dmvector).  The  computed  motion  vector  is 
obtained  by  a  temporal  scaling  of  the  transmitted  motion  vector,  and  for  the 
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final  corrected  motion  vector,  each  prediction  pixel  has  opposite  parity  to  the 
target  pixel.  The  two  preliminary  predictions  are  then  averaged  together  to 
form  the  hnal  prediction. 

•  Field  pictnres 

Field  prediction  for  field  pictures  is  similar  to  field  prediction  for  frame  pic¬ 
tnres,  except  that  there  is  no  half-MB  (in  field  pictnres  all  the  pixels  belong 
to  the  same  field).  So  for  a  particnlar  MB,  all  pixels  have  the  same  parity 
(i.e.  they  all  come  from  the  same  field).  For  P-field  pictnres,  the  prediction 
MB  may  come  from  either  of  the  two  most  recently  coded  I-  or  P-fields,  even 
when  coding  the  second  field  of  a  frame,  if  the  prediction  field  is  the  first  field 
of  the  same  frame.  For  B-field  pictnres,  the  backward  prediction  MB  is  taken 
from  either  field  of  the  most  recently  coded  I-  or  P-frame,  and  the  forward 
prediction  MB  from  either  field  of  the  last  bnt  one  I-  or  P-frame.  For  field 
selection,  the  motion_vertical_field^elect[r][s]  variable  is  also  nsed,  bnt  with 
r=0  only  (only  one  MB).  Up  to  one  motion  vector  is  assigned  to  each  MB  in 
a  P-field  pictnre  and  np  to  two  in  a  B-field  pictnre. 

16  X  8  MC  prediction  mode  is  similar  to  field  prediction  for  frame  pictnres:  The 
target  MB  is  split  into  an  npper  half  and  a  lower  half,  and  a  separate  field 
prediction  is  performed  for  each  half,  as  in  field  prediction  for  frame  pictnres. 
Up  to  two  motion  vectors  are  assigned  to  each  MB  in  a  P-frame  pictnre  (one 
for  each  half-field-MB),  and  np  to  fonr  in  a  B-frame  pictnre. 

Dual-Prime  mode  for  P-pictnres  is  the  same  as  in  frame  pictnres  except  that  for 
the  two  preliminary  predictions,  reference  pixels  are  taken  from  only  one  field: 
the  same  parity  as  the  target  MB  for  the  first  prediction,  and  the  opposite 
parity  for  the  second  prediction. 

There  are  three  motion  modes  for  each  pictnre  type  (frame  or  field),  one  nsing  one 
motion  vector  for  P-pictnres  and  two  for  B-pictnres  (frame  prediction  for  frame  pictnres, 
field  prediction  for  field  pictnres),  another  nsing  two  motion  vectors  for  P-pictnres  and 
fonr  for  B-pictnres  (field-prediction  for  frame  pictnres,  16  X  8  MC  for  field  pictnres),  and 
the  last  nsing  one  motion  vector  and  a  dmvector  (Dnal-Prime  for  P-pictnres).  It  is  worth 
noting  that  the  macroblock  type  seems  to  be  chosen  before  the  motion  compensation 
mode  in  most  encoders. 

A. 5. 6  Frame-dct  and  field-dct  coding 

MPFG-2  provides  another  featnre  for  dealing  with  interlaced  pictnres.  For  frame  pictnres, 
on  a  macroblock-by-macroblock  basis,  the  dctType  can  be  set  as  frame_dct  (0)  or  field_dct 
(1).  The  frame  dct  type  is  the  same  dct  coding  as  in  MPFG-1.  With  the  field  dct  type, 
jnst  prior  to  performing  the  DCT,  the  encoder  may  reorder  the  Inminance  lines  within  a 
MB  so  that  the  first  8  lines  come  from  the  top  field,  and  the  last  8  lines  come  from  the 
bottom  field.  This  reordering  is  nndertaken  jnst  after  the  Inverse  DCT.  With  field_dct 
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8x8  DCT  coefficients  block 


Figure  9:  Alternate  scan  of  the  DCT  coefficients  of  an  8  X  8  block. 


in  an  interlaced  frame  picture,  the  vertical  correlation  within  the  luminance  blocks  is 
increased. 

For  each  frame,  the  f rame_predjframe_dct  is  a  kind  of  shortcut  variable.  If 
f  rame_predjframe_dct  =  1,  then  only  frame  prediction  and  frame  DCT  are  used  within 
the  frame.  If  f rame_predjframe_dct  =  0  then  all  the  motion  compensation  modes  and 
all  the  DCT  types  can  be  used.  If  progressivejframe  =  1  then  f rame_pred_f rame_dct 
=  1,  and  if  picture_structure  =  1  or  2,  then  f rame_pred jframe_dct  =  0. 

In  fact,  for  each  picture  type  (as  described  before),  only  specihc  types  of  compensation 
modes  and  dct  coding  are  allowed  (see  also  Fig.  7). 

A. 5. 7  Alternate  scan 

The  main  effect  of  interlace  in  frame  pictures  is  that  since  adjacent  scan  lines  come  from 
different  helds,  vertical  correlation  is  reduced  when  there  is  motion  in  the  scene.  This 
vertical  correlation  reduction  provides  a  non-optimum  zig-zag  scanning  order.  That  is 
the  reason  why  MPEG-2  has  an  Alternate-Scan  mode,  shown  in  Fig.  9.  The  type  of  scan 
may  be  specihed  by  the  encoder  on  a  picture-by-picture  basis. 

A. 5. 8  Chrominance  formats 

MPEG-2  specihes  additional  Y:U:V  luminance  and  chrominance  subsampling  ratio  for¬ 
mats  to  assist  and  foster  applications  with  high  video  quality  requirements.  In  addition  to 
the  4:1:1  format  already  supported  by  MPEG-1  the  specihcation  of  MPEG-2  is  extended 
to  4:2:2  and  4:4:4  formats  suitable  for  studio  video  coding  applications.  4:2:2  means  the 
chrominance  is  horizontally  subsampled  by  a  factor  of  two  relative  to  the  luminance; 
4:1:1  (also  called  4:2:0)  means  the  chrominance  is  horizontally  and  vertically  subsampled 
by  a  factor  of  two  relative  to  the  luminance.  In  the  MAIN  Prohle  at  MAIN  Level,  only 
the  4:2:0  format  is  allowed. 
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A. 5. 9  MPEG-2  scalability  techniques 

The  scalability  tools  standardized  by  MPEG-2  support  applications  beyond  those  ad¬ 
dressed  by  the  basic  MAIN  Proble  coding  algorithm.  The  intention  of  scalable  coding 
is  to  provide  interoperability  between  different  services  and  to  flexibly  support  receivers 
with  different  display  capabilities.  Receivers  either  unable  or  unwilling  to  reconstruct 
the  full-resolution  video  can  decode  subsets  of  the  layered  bit  stream  to  display  video  at 
lower  spatial  or  temporal  resolution  or  with  lower  quality.  Another  important  purpose  of 
scalable  coding  is  to  provide  a  layered  video  bit  stream  which  is  amenable  to  prioritized 
transmission. 

For  instance,  two  layers  can  be  provided,  each  layer  supporting  video  at  a  different 
scale,  i.e.  a  multiresolution  representation  can  be  achieved  by  downscaling  the  input 
video  signal  into  a  lower-resolution  video  (downsampling  spatially  or  temporally).  The 
downscaled  version  is  encoded  into  a  base  layer  bit  stream  with  a  reduced  bit  rate.  The 
upscaled  reconstructed  base  layer  video  (upsampled  spatially  or  temporally)  is  used  as 
a  prediction  for  the  coding  of  the  original  input  video  signal.  The  prediction  error  is 
encoded  into  an  enhancement  layer  bit  stream.  If  a  receiver  is  either  unable  or  unwilling 
to  display  the  full-quality  video,  a  downscaled  video  signal  can  be  reconstructed  by 
decoding  only  the  base  layer  bit  stream.  Thus  scalable  coding  can  be  used  to  encode 
video  with  a  suitable  bit  rate  allocated  to  each  layer  in  order  to  meet  specific  bandwidth 
requirements  of  transmission  channels  or  storage  media.  Browsing  through  video  data 
bases  and  transmission  of  video  over  heterogeneous  networks  are  applications  expected 
to  benefit  from  this  functionality. 

A. 5. 10  The  MPEG-2  video  stream  syntax 

The  MPEG-2  video  standard  specifies  the  syntax  and  semantics  of  the  compressed  video 
stream  produced  by  the  video  encoder.  Most  of  MPEG-2  consists  of  additions  to  MPEG- 
f.  The  video  stream  syntax  is  flexible  to  support  the  variety  of  applications  envisaged 
for  the  MPEG-2  video  standard.  Like  MPEG-f,  the  syntax  is  constructed  in  a  hier¬ 
archy  of  headers  which  are:  Video  sequence  header.  Group  of  Pictures  header.  Picture 
header.  Slice  header  and  Macroblock  header.  The  block  contains  the  DOT  coefficients 
(see  Fig.  5  and  Fig.  6).  Useful  information  about  these  headers  can  be  found  in  the 
MPEG  specifications  [9]. 
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B  Appendix:  UML  diagrams 


These  UML  diagrams  (Figs.  10  to  13)  provide  further  explanations  about  the  structure 
of  MERIT  4.0.  Since  MERIT  is  not  an  object-oriented  application,  class  diagrams  are 
not  used  here.  The  structure  diagrams  show  the  structure  of  MERIT  and  of  typical 
MPEG  hies,  and  the  collaboration  diagrams  depict  the  interactions  between  the  different 
components  of  MERIT. 
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Figure  11:  MPEG  file  structure  diagram. 
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Figure  13:  Processing  collaboration  diagram. 
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